Video Feature Extraction¶
Now that we have our training and testing tensors, we can start extracting relevant features from the video data.
Importing Libraries and Data¶
import os
import numpy as np
from matplotlib.animation import FuncAnimation
from IPython.display import HTML
from scipy.signal import convolve2d
import tensorly as tl
import gc
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
train_test_data_dir = os.path.join(os.getcwd(), 'Train_Test_Data')
train_videos = np.load(os.path.join(train_test_data_dir, 'train_videos.npy'))
test_videos = np.load(os.path.join(train_test_data_dir, 'test_videos.npy'))
y_train = np.load(os.path.join(train_test_data_dir, 'y_train.npy'))
y_test = np.load(os.path.join(train_test_data_dir, 'y_test.npy'))
id_train = np.load(os.path.join(train_test_data_dir, 'id_train.npy'))
id_test = np.load(os.path.join(train_test_data_dir, 'id_test.npy'))
print(train_videos.shape)
print(test_videos.shape)
print(y_train.shape)
print(y_test.shape)
print(id_train.shape)
print(id_test.shape)
labels = ['Neutral', 'Calm', 'Happy', 'Sad', 'Angry', 'Fearful', 'Disgust', 'Surprised']
(240, 426, 3, 50, 960) (240, 426, 3, 50, 480) (960,) (480,) (960, 7) (480, 7)
Breakdown of Custom 3D Histogram of Oriented Gradients for Dimensionality Reduction¶
The formulation of this Histogram of Oriented Gradients algorithm is loosely based off of recent research in the field of computer vision. However, this project translates this approach for 3 dimensions. Here's an overview of the methodology that is applied to both training and testing datasets:
1. Iterate through every sample, convert the frames to grayscale, and (optionally) apply the following Gaussian filter to each frame of the video ($V$):
$$Gaussian\>Filter \> \Longrightarrow \> \frac{1}{1115}\begin{pmatrix} 1 & 4 & 7 & 10 & 7 & 4 & 1 \\ 4 & 12 & 26 & 33 & 26 & 12 & 4 \\ 7 & 26 & 55 & 71 & 55 & 26 & 7 \\ 10 & 33 & 71 & 91 & 71 & 33 & 10 \\ 7 & 26 & 55 & 71 & 55 & 26 & 7 \\ 4 & 12 & 26 & 33 & 26 & 12 & 4 \\ 1 & 4 & 7 & 10 & 7 & 4 & 1 \end{pmatrix} \\ $$
2. We then compute the gradients with respect to the height, width [1], and frames ($\frac{\partial V}{\partial x}$, $\frac{\partial V}{\partial y}$, $\frac{\partial V}{\partial z}$).
3. Using these gradients, we can compute the three dimensional gradient magnitude for each video. $$ G = \sqrt{\left( \frac{\partial V}{\partial x}\right)^2 + \left( \frac{\partial V}{\partial y} \right)^2 + \left( \frac{\partial V}{\partial z}\right)^2} $$
4. Generally for images ($I$), we compute the gradient direction by $\theta = \arctan \left( \frac{\frac{\partial I}{\partial y}}{\frac{\partial I}{\partial x}} \right)$ [2]. Since we have videos, we must compute the azimuthal angle and the polar angle to capture the 3D feature-space [4].
$$ \theta_{azimuth} = \arctan \left( \frac{\frac{\partial V}{\partial y}}{\frac{\partial V}{\partial x}} \right) \\ $$
$$ \phi_{polar} = \arctan \left( \frac{\sqrt{\left( \frac{\partial V}{\partial x}\right)^2 + \left( \frac{\partial V}{\partial y} \right)^2}}{\frac{\partial V}{\partial z}} \right) \\ $$
5. With our three sets of features, we can now partition the video into cells [3]. In our case, the cell size is ($5$, $6$, $5$), which will group sets of 180 pixels together.
6. With our grouped pixels, we cluster the gradient magnitudes of each pixel ($G_{(i, j, k)}$) into bins based on the azimuthal and polar angles. With $9$ bins, we sum the gradient magnitudes for all the pixels belonging to each bin for both types of angles, which reduces the dimensionality from $180$ points in each cell to $9*2 = 18$ points per cell. We can then save these results to disk.
Example of HOG Features (in 2D):

Source: https://www.sciencedirect.com/topics/computer-science/histogram-of-oriented-gradient
We follow a similar approach as outlined in the image above, but in 3D.
Here's an example of what the gradient magnitude, azimuthal angle, and polar angle looks like for the videos:¶
sample = 72 #random video from training data
gray_frames = tl.tenalg.mode_dot(train_videos[:, :, :, :, sample], np.array([0.2989, 0.5870, 0.1140]), mode=2)
gx = np.gradient(gray_frames, axis=0)
gy = np.gradient(gray_frames, axis=1)
gz = np.gradient(gray_frames, axis=2)
magnitude = np.sqrt(gx**2 + gy**2 + gz**2)
azimuthal_angle = np.arctan2(gy, gx)
polar_angle = np.arctan2(np.sqrt(gx**2, gy**2), gz)
del gx, gy, gz
gc.collect()
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 7))
axes=axes.flatten()
im1 = axes[0].imshow(gray_frames[:, :, 0], cmap='gray')
im2 = axes[1].imshow(magnitude[:, :, 0], cmap='gray')
im3 = axes[2].imshow(azimuthal_angle[:, :, 0], cmap='gray')
im4 = axes[3].imshow(polar_angle[:, :, 0], cmap='gray')
axes[0].set_title('Grayscale Video')
axes[1].set_title('Gradient Magnitude')
axes[2].set_title('Azimuthal Angle')
axes[3].set_title('Polar Angle')
def update(i):
im1.set_array(gray_frames[:, :, i])
im2.set_array(magnitude[:, :, i])
im3.set_array(azimuthal_angle[:, :, i])
im4.set_array(polar_angle[:, :, i])
return [im1, im2, im3]
plt.tight_layout()
plt.close(fig)
fig.suptitle(f'Example features for {labels[y_train[sample]-1]} emotion')
ani = FuncAnimation(fig, update, frames=gray_frames.shape[2]//2, blit=False, interval=100)
HTML(ani.to_jshtml())